Revereberation Reduction for Improved Speech Recognition
نویسندگان
چکیده
In this paper we present a dereverberation algorithm for improving automatic speech recognition (ASR) results with minimal CPU overhead. As the reverberation tail hurts ASR the most, late reverberation is reduced via gain-based spectral subtraction. We use a multi-band decay model with an efficient method to update it in realtime. In reverberant environments the multi-channel version of the proposed algorithm reduces word error rates (WER) up to one half of the way between those of a microphone array only and a close-talk microphone. The four channel implementation requires less than 2% of the CPU power of a modern computer. Introduction The need to present clean sound inputs to today's speech recognition engines has fostered huge amounts of research into areas of noise suppression, microphone array processing, acoustic echo cancellation and methods for reducing the effects of acoustic reverberation. Reducing reverberation through deconvolution (inverse filtering) is one of the most common approaches. The main problem is that the channel must be known or very well estimated for successful deconvolution. The estimation is done in the cepstral domain [1] or on envelope levels [2]. Multi-channel variants use the redundancy of the channel signals [3] and frequently work in the cepstral domain [4]. Blind dereverberation methods seek to estimate the input(s) to the system without explicitly computing a deconvolution or inverse filter. Most of them employ probabilistic and statistically based models [5]. Dereverberation via suppression and enhancement is similar to noise suppression. These algorithms either try to suppress the reverberation, enhance the direct-path speech, or both. There is no channel estimation and there is no signal estimation, either. Usual techniques are longterm cepstral mean subtraction [6], pitch enhancement [7], LPC analysis [8] in single or multi-channel implementation. The most common issues with the preceding methods are slow reaction when reverberation changes, robustness to noise, and computational requirements. Modeling and assumptions We convoluted clean speech signal with a typical room response function and processed it trough our ASR engine, cutting the length of the response function after some point. The results are shown on Figure 1. The early reverberation practically has no effect on the ASR results, most probably due to cepstral mean subtraction (CMS) in the ASR engine front end. The CMS compensates for the constant part of the input channel response and removes the early reverberation. The reverberation has noticeable effect on WER between 50 ms and RT30. In this time interval the reverberation behaves more as non-stationary, uncorrelated decaying noise ) ( f R : ) ( ) ( ) ( f f X f Y R + = (1) We assume that the reverberation energy in this time interval decays exponentially and is the same in every point of the room (i.e. it is diffuse). Our decay model is frequency dependent:
منابع مشابه
Improved Bayesian Training for Context-Dependent Modeling in Continuous Persian Speech Recognition
Context-dependent modeling is a widely used technique for better phone modeling in continuous speech recognition. While different types of context-dependent models have been used, triphones have been known as the most effective ones. In this paper, a Maximum a Posteriori (MAP) estimation approach has been used to estimate the parameters of the untied triphone model set used in data-driven clust...
متن کاملشبکه عصبی پیچشی با پنجرههای قابل تطبیق برای بازشناسی گفتار
Although, speech recognition systems are widely used and their accuracies are continuously increased, there is a considerable performance gap between their accuracies and human recognition ability. This is partially due to high speaker variations in speech signal. Deep neural networks are among the best tools for acoustic modeling. Recently, using hybrid deep neural network and hidden Markov mo...
متن کاملA Database for Automatic Persian Speech Emotion Recognition: Collection, Processing and Evaluation
Abstract Recent developments in robotics automation have motivated researchers to improve the efficiency of interactive systems by making a natural man-machine interaction. Since speech is the most popular method of communication, recognizing human emotions from speech signal becomes a challenging research topic known as Speech Emotion Recognition (SER). In this study, we propose a Persian em...
متن کاملOff-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model
In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...
متن کاملAn evaluation of adaptive beamformer based on average speech spectrum for noisy speech recognition
Distant-talking speech recognition in noisy environments is indispensable for self-moving robots or tele-conference systems. However, background noise and room reverberations seriously degrade the sound-capture quality in real acoustic environments. A microphone array is an ideal candidate as an effective method for capturing distant-talking speech. AMNOR (Adaptive Microphone-array for NOise Re...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004